When a human communicates with a machine using natural language on the web and online, how can it understand the human's intention and semantic context of their talk? This is an important AI task as it enables the machine to construct a sensible answer or perform a useful action for the human. Meaning is represented at the sentence level, identification of which is known as intent detection, and at the word level, a labelling task called slot filling. This dual-level joint task requires innovative thinking about natural language and deep learning network design, and as a result, many approaches and models have been proposed and applied. This tutorial will discuss how the joint task is set up and introduce Spoken Language Understanding/Natural Language Understanding (SLU/NLU) with Deep Learning techniques. We will cover the datasets, experiments and metrics used in the field. We will describe how the machine uses the latest NLP and Deep Learning techniques to address the joint task, including recurrent and attention-based Transformer networks and pre-trained models (e.g. BERT). We will then look in detail at a network that allows the two levels of the task, intent classification and slot filling, to interact to boost performance explicitly. We will do a code demonstration of a Python notebook for this model and attendees will have an opportunity to watch coding demo tasks on this joint NLU to further their understanding.
translated by 谷歌翻译
Most TextVQA approaches focus on the integration of objects, scene texts and question words by a simple transformer encoder. But this fails to capture the semantic relations between different modalities. The paper proposes a Scene Graph based co-Attention Network (SceneGATE) for TextVQA, which reveals the semantic relations among the objects, Optical Character Recognition (OCR) tokens and the question words. It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image. We created a guided-attention module to capture the intra-modal interplay between the language and the vision as a guidance for inter-modal interactions. To make explicit teaching of the relations between the two modalities, we proposed and integrated two attention modules, namely a scene graph-based semantic relation-aware attention and a positional relation-aware attention. We conducted extensive experiments on two benchmark datasets, Text-VQA and ST-VQA. It is shown that our SceneGATE method outperformed existing ones because of the scene graph and its attention modules.
translated by 谷歌翻译
We propose a PiggyBack, a Visual Question Answering platform that allows users to apply the state-of-the-art visual-language pretrained models easily. The PiggyBack supports the full stack of visual question answering tasks, specifically data processing, model fine-tuning, and result visualisation. We integrate visual-language models, pretrained by HuggingFace, an open-source API platform of deep learning technologies; however, it cannot be runnable without programming skills or deep learning understanding. Hence, our PiggyBack supports an easy-to-use browser-based user interface with several deep learning visual language pretrained models for general users and domain experts. The PiggyBack includes the following benefits: Free availability under the MIT License, Portability due to web-based and thus runs on almost any platform, A comprehensive data creation and processing technique, and ease of use on deep learning-based visual language pretrained models. The demo video is available on YouTube and can be found at https://youtu.be/iz44RZ1lF4s.
translated by 谷歌翻译
In-game toxic language becomes the hot potato in the gaming industry and community. There have been several online game toxicity analysis frameworks and models proposed. However, it is still challenging to detect toxicity due to the nature of in-game chat, which has extremely short length. In this paper, we describe how the in-game toxic language shared task has been established using the real-world in-game chat data. In addition, we propose and introduce the model/framework for toxic language token tagging (slot filling) from the in-game chat. The data and code will be released.
translated by 谷歌翻译
Scene Graph Generation (SGG) serves a comprehensive representation of the images for human understanding as well as visual understanding tasks. Due to the long tail bias problem of the object and predicate labels in the available annotated data, the scene graph generated from current methodologies can be biased toward common, non-informative relationship labels. Relationship can sometimes be non-mutually exclusive, which can be described from multiple perspectives like geometrical relationships or semantic relationships, making it even more challenging to predict the most suitable relationship label. In this work, we proposed the SG-Shuffle pipeline for scene graph generation with 3 components: 1) Parallel Transformer Encoder, which learns to predict object relationships in a more exclusive manner by grouping relationship labels into groups of similar purpose; 2) Shuffle Transformer, which learns to select the final relationship labels from the category-specific feature generated in the previous step; and 3) Weighted CE loss, used to alleviate the training bias caused by the imbalanced dataset.
translated by 谷歌翻译
协作过滤问题通常是基于矩阵完成技术来解决的,该技术恢复了用户项目交互矩阵的缺失值。在矩阵中,额定位置专门表示给定的用户和额定值。以前的矩阵完成技术倾向于忽略矩阵中每个元素(用户,项目和评分)的位置,但主要关注用户和项目之间的语义相似性,以预测矩阵中缺少的值。本文提出了一种新颖的位置增强的用户/项目表示培训模型,用于推荐,Super-Rec。我们首先使用相对位置评级编码并存储位置增强的额定信息及其用户项目与嵌入的固定尺寸,而不会受矩阵大小影响。然后,我们将受过训练的位置增强用户和项目表示形式应用于最简单的传统机器学习模型,以突出我们表示模型的纯粹新颖性。我们对建议域中的位置增强项目表示形式进行了首次正式介绍和定量分析,并对我们的Super-Rec进行了原则性的讨论,以表现优于典型的协作过滤推荐任务,并具有明确的和隐式反馈。
translated by 谷歌翻译
基于文本的游戏(TBG)是复杂的环境,允许用户或计算机代理进行文本交互并实现游戏目标。为基于文本的游戏构建面向目标的计算机代理是一项挑战,尤其是当我们使用逐步反馈作为模型的唯一文本输入时。此外,代理商很难通过从更大的文本输入空间中评估灵活的长度和形式。在本文中,我们对应用于基于文本的游戏字段的深度学习方法进行了广泛的分析。
translated by 谷歌翻译
在线仇恨言语检测已随着数字设备的增长而变得重要,但是英语以外的其他语言资源非常有限。我们介绍了K-MHAS,这是一种新的多标签数据集,用于仇恨言语检测,可有效处理韩国语言模式。该数据集由新闻评论中的109k话语组成,并提供了从1到4个标签的多标签分类,并处理主观性和相交性。我们评估了K-MHAS上强的基线。Kr-Bert带有子字符的代币器优于表现,在每个仇恨言论类中都认识到分解的角色。
translated by 谷歌翻译
在将文档解析为下游应用程序的结构化,机器可读格式时,识别非结构化数字文档的布局至关重要。文档布局分析中的最新研究通常依靠计算机视觉模型来理解文档,同时忽略其他信息,例如上下文信息或文档组件的关系,这对于捕获至关重要。我们的DOC-GCN提出了一种有效的方法,可以协调和整合异质方面以进行文档布局分析。我们首先构造图形以明确描述四个主要方面,包括句法,语义,密度和外观/视觉信息。然后,我们应用图形卷积网络来表示信息的各个方面,并使用池进行集成。最后,我们将各个方面汇总,并将它们送入2层MLP,以进行文档布局组件分类。我们的DOC-GCN实现了新的最先进的结果,从而获得了三个广泛使用的DLA数据集。
translated by 谷歌翻译
注意机制已被用作跨视觉和语言(VL)任务的重要组成部分,以弥合视觉和文本特征之间的语义差距。尽管注意力已被广泛用于VL任务,但尚未研究其在弥合视觉和文本线索之间语义差距方面的不同注意对准计算的能力。在这项研究中,我们通过研究注意力评分计算方法,并检查其实际代表视觉区域的作用以及文本令牌对全球评估的重要性,对了解注意力对齐的作用进行全面分析。我们还分析了注意力分数计算机制的条件更多(或更少)可解释,并且可能会影响三个不同VL任务的模型性能,包括视觉问题答案,文本到图像生成,文本和图像匹配(句子和图像检索)。我们的分析是同类产品中的第一个,并提供了在VL任务的训练阶段应用的每个注意力对齐得分计算的重要性,这通常在基于注意力的交叉模态模型和/或预审前的模型中被忽略。
translated by 谷歌翻译